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Abstract 

In this paper, we consider the imphcations of the fact that parallel raw-power can be 
exploited by a generic Metropolis-Hastings algorithm if the proposed values are independent 
from the current value of the Markov chain. In particular, we present improvements to the 
independent Metropolis-Hastings algorithm that significantly decrease the variance of any 
estimator derived from the MCMC output, at a null computing cost since those improve- 
ments are based on a fixed number of target density evaluations that can be produced in 
parallel. The techniques developed in this paper do not jeopardize the Markovian conver- 
gence properties of the algorithm, since they are based on the Rao-Blackwell principles of 



Gelfand and Smith (1990), already exploited in Casella and Robert (1996), Atchade and 



Perron (2005) and Douc and Robert (2011 ). We illustrate those improvements both on a toy 



normal example and on a classical probit regression model, but stress the fact that they are 
applicable in any case where the independent Metropolis-Hastings is applicable. 
Keywords: MCMC algorithm, independent Metropolis-Hastings, parallel computation, 
Rao-Blackwellization, permutation. 

1 Introduction 

The Metropolis-Hastings (MH) algorithm provides an iterative and converging scheme to sample 
from a complex target density vr. Each iteration of the algorithm generates a new value of the 
Markov chain that relies on the result of the previous iteration. The underlying Markov principle 
is well-understood and leads to a generic convergence principle as described, e.g., in [Robert and 



Casella (2004). However, due to its Markovian nature, this algorithm is not straightforward to 



parallelize, which creates difficulties in slower languages like R (R Development Core Team 2006) 



Nevertheless, the increasing number of parallel cores that are available at a very low cost drives 
more and more interest in "parallel-friendly" algorithms, that is, in algorithms that can benefit 
from the available parallel processing units on standard computers (see. e.g.. Holmes et al. (2011), 



Lee et al.| ( |2009[ ), [Suchard et al.| ( |2010[ )). 

Different techniques have already been used to enhance some degree of parallelism in generic 
Metropolis-Hastings (MH) algorithms, beside the basic scheme of running p MCMC algorithms 
independently in parallel and merging the results. For instance, a natural entry is to rely on 



renewal properties of the Markov chain (Mykland et al. 1995 Robert 1995 Hobert et al. 2002) 
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waiting for all p chains to exhibit a renewal event and then using the blocks as iid, but the 
constraint of Markovianity cannot be removed. Rosenthal (2000) also points out the difficult issue 



of accounting for the burn-in time: while, for a single MCMC run, the burn-in time is essentially 
negligible, it does create a significant bias when running parallel chains (unless perfect sampling 
can be implemented). Craiu and Meng (2005) mix antithetic coupling and stratification with 



perfect sampling. Using a different approach, Craiu et al. (2009) rely on p parallel chains to build 
an adaptive MCMC algorithm, considering in essence that the product of the target densities over 
the chains is their target, a perspective that obviously impacts the convergence properties of the 
multiple chain. Corander et al. (2006) take advantage of parallelization to build a non-reversible 
algorithm that can avoid the scaling effect of specific neighborhood structures, hence focussing on 
a very special type of problem. 

A particular family of MH algorithm is the Independent Metropolis-Hastings (IMH) algorithm, 
where the proposal distribution (and hence the proposed value) does not depend on the current 
state of the Markov chain. Due to this characteristic, this specific algorithm is easier to parallelize 
and can therefore be considered as a good building block toward efficient parallel Markov Chain 
Monte Carlo algorithms, as will be explained in Section |2j We will focus on cases where the 
computation of the likelihood function constitutes the major part of the execution time in the MH 
algorithm. A most realistic example of this setting is provided in Wraith et al. (2009), where the 
model is based on a very complex Fortran program translating the results of several cosmological 
experiments, hence highly demanding in computing time. In this model. Wraith et al. (2009) use 
adaptive importance sampling and massive parallelization, rather than MCMC. 

The fundamental idea in the current paper is that one can take advantage of the parallel abili- 
ties of arbitrary items of computing machinery, from cloud computing to graphical cards (GPU), in 
the case of the generic IMH algorithm, producing an output that corresponds to a much improved 
Monte Carlo approximation machine at the same computational cost. The techniques presented 



here are related with those explained in Perron ( 1999 ) and more closely to those in Atchade and 



Perron (2005, Section 3.1), since these authors condition upon the order statistic of the values pro- 



posed by the IMH algorithm, although in those earlier papers the links with parallel computation 
were not established and hence the implementation of the Rao-Blackwellization scheme became 
problematic for long chains. 

The plan of the paper is as follows: the standard IMH algorithm is recalled in Section |2| 
followed by a description of our improving scheme, called here "block Independent Metropolis- 
Hastings" (block IMH). This improvement depends on a choice of permutations on {1, ... ,p} that 
is described in details in Section [H We demonstrate the connections between block IMH and 
Rao-Blackwellization in Section |4j Results for a toy example are presented throughout the paper 
and a realistic probit regression example is described in Section |5] as an illustration of the method. 



2 Improving the IMH algorithm 
2.1 Standard IMH algorithm 

We recall here the basic IMH algorithm, assuming the availability of a proposal distribution that 
we can sample, and which probability density is known up to a normalization constant. The 
independent Metropolis-Hastings algorithm, described in Algorithm [l} generates a Markov chain 
with invariant density vr, corresponding to the target distribution. 
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Algorithm 1 IMH algorithm 
1: Set xo to an arbitrary value 
2: for t = 1 to T do 

3: Generate yt ^ 

4: Compute the ratio: 

' fi{yt) 7r(xt_i) J 

5: Set xt = yt with probability p{xt-i,yt)', otherwise set Xt = Xt-i 
6: end for 



p{xt-i,yt) = min 



In the larger picture of Monte Carlo and MCMC algorithms, the IMH algorithm holds a rather 



special status. It has certainly been studied more often than other MCMC schemes (Robert and 



Casella, 2004), but it is undoubtedly a less practical solution than the more generic random walk 



Metropolis-Hastings algorithm. For instance, it is rather rarely used by itself because it requires 
the derivation of a tolerably good approximation to the true target, approximation that most 
often is unavailable. On the other hand, first-order approximations and Metropolis-within-Gibbs 
schemes are not foreign to calling for IMH local moves based on Gaussian representations of 
the targets. The reason theoretical studies of the IMH algorithm abound is that it has strong 
links with the non-Markovian simulation methods such as importance sampling. Contrary to 
random-walk Metropolis-Hastings schemes, IMH algorithms may enjoy very good convergence 
properties and may also reach acceptance probabilities that are close to one. Furthermore, the 
potentially large gain in variance reduction provided by the parallelization scheme developped in 
this paper may counteract the lesser efficiency of the original IMH compared with the random 
walk Metropolis-Hastings algorithm. 

An important feature of the IMH algorithm, when addressing parallelism, is that it cannot work 
but in an iterative manner, since the outcome of step t, namely the value Xt, is required to compute 
the acceptance ratio at step t + 1. This sequential construction is compulsory for the validation of 
the algorithm given the Markov property at its core (Robert and Casella, 2004). Nonetheless, given 
that, in the IMH algorithm, the proposed values (yt) are generated independently from the current 
state of the Markov chain, xt, it is altogether possible to envision the generation of T proposed 
values yt first, along with the computation of the associated ratios Ut = n{yt)/ fi{yt)- Once this 
computation requirement is completed, only the acceptance steps need to be considered iteratively. 
This two-step perspective makes for a huge saving in computing time when the simulation of the 
l/t's and the derivation of the UtS can be achieved in parallel since both the remaining computation 
of the ratios p{xt-i, yt) given the Uts and their subsequent comparison with uniform draws typically 
are orders of magnitude faster. 

In this respect the IMH algorithm strongly diff^ers from the random walk Metropolis-Has- 
tings (RWMH) algorithm, for which acceptance ratios cannot be processed beforehand because 
the proposed simulated values depend on the current value of the Markov chain. The universal 
availability of parallel processing schemes may thus lead to a new surge of popularity for the IMH 
algorithm. Indeed, when taking advantage of p parallel processing units, an IMH can be run for p 
times as many iterations as RWMH, at almost the same computing cost since RWMH cannot be 
directly parallelized. 

In order to better describe this increased computing power, we first note that, once T successive 
values of a Markov chain have been produced, the sequence is usually processed as a regular 
Monte Carlo sample to obtain an approximation of an expectation under the target distribution. 
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[^(-^)] say, for some arbitrary functions h. We propose in this paper a technique that improves 
the precision of the estimation of this expectation by taking advantage of parallel processing units 
without jeopardizing the Markov property. 

Before presenting our improvement scheme, we introduce the notation V (read "or") for the 
operator that represents a single step of the IMH algorithm. Using this notation, given the current 
value Xt and a sequence of p independent proposed values yi, . . . , ?/p ~ /x, the IMH algorithm goes 
from step t to step t + p according to the diagram in Figure [1} 

Xt xt+i := Xt V yi ^ Xt+2 ■= Xt+i V y2 • ■ ■ Xt+p := Xt+p-i V yp 

Figure 1: IMH steps between iteration t and iteration t + p. 

2.2 Block IMH algorithm 

We propose to take full advantage of the simulated proposed values and of the computation of their 
corresponding u ratios. To this effect, we introduce the block IMH algorithm, made of successive 
simulations of blocks of size p x p. In this alternative scheme, the number of blocks b is such 
that the number of desired iterations T is equal to b * p, in order to keep the comparison with a 
standard IMH output fair. Usually p needs not be calibrated since it represents the number of 
physical parallel processing units that can be exploited by the code. However, in principle, this 
number p can be set arbitrarily high and based on virtual parallel processing units, the drawback 
being then an increase in the computing cost. (Note that the block IMH algorithm can also be 
implemented with no parallel abilities, still it provides a gain in variance that may counteract the 
increase in time.) In the following examples, we take p varying from 4 to 100. We first explain 
how a block is simulated, and then how to move from one block to the next. 

A p X p block consists in the generation of p parallel generations of p values of Markov chains, 
all starting at time t from the current state Xt and all based on the same proposed simulated 
values yi, . . . ,yp. The different between the p fioes is the orders in which those ?/j's are included. 
For instance, these orders may be the p circular permutations of yi,. . . ,yp, or they may be in- 
stead random permutations, as discussed in detail (and compared) in Section [3] The block IMH 
algorithm is illustrated in Figure |2] for the circular set of permutations. 




p X p block 



Figure 2: Block simulation from step t + 1 to step t + p. Here, circular permutations of the 
proposed values are used for illustration purposes. 
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It should be clear that each of the p parallel chains found in this block is a valid MCMC 
sequence of length p when taken separately. As such, it can be processed as a regular MCMC 
output. In particular, if Xt is simulated from the stationary distribution, any of the subsequent 
x^^^j^ is also simulated from the stationary distribution. However, the point of the p parallel flows 
is double: 

• it aims at integrating out part of the randomness resulting from the ancillary order in which 
the i/fc's are chosen, getting close to the conditioning on the order statistics of the y^s 



advocated by Perron (1999); 



• it also aims at partly integrating out the randomness resulting from the generation of uni- 
form variables in the selection process, since the block implementation results in drawing p^ 
uniform realizations instead of p uniform realizations for a standard IMH setting. 

Both of those points essentially amount to implementing a new Rao-Blackwellization technique 
(a more precise connection is drawn in Section |4]). In an independent setting, each of the y^'s 
occurs a number > of times across the p steps of the p parallel chains, i.e. for a number p^ 
of realizations. Therefore, when considering the standard estimator fi of based on a 

single MCMC chain. 



1 

ri{xt,yi:p) = - y^h{xt+k} 
^ k=i 

this estimator necessarily has a larger variance than the double average 



^ p p 1 ^ 

f2{xt, yi:p) = -o^Yl ^(^Sfe) = Z^Yl ^kKyk) 



^ j=l k=l ^ k=0 

where yo := Xt and uq is the number of times Xt is repeated. (The proof for the reduction of the 
variance from fi to 7^2 easily follows from a double integration argument.) We again insist on the 
compelling feature that computing 7^2 using p parallel processing units does not cost more time 
than computing fi using a single processing unit. 

In order to preserve its Markov validation, the algorithm must properly continue at time t+p. 
An obvious choice is to pick one of the p sequences at random and to take the corresponding x['2p 
as the value of Xt+p, starting point of the next parallel block. This mechanism is represented in 
Figure [3j While valid from a Markovian perspective, since the sequences are marginally produced 
by a regular IMH algorithm, this means that the chain deduced from the block IMH algorithm is 
converging at exactly the same speed as the original IMH algorithm. An alternative choice for the 
starting points of the blocks takes advantage of the weights on the y^s that are computed via 
the block structure. Indeed, those weights essentially act as importance weights and they allow for 
a selection of any of the 3;|:^\'s as the starting point of the incoming block, which corresponds 
to choosing one of the proposed y^s with probability proportional to n^. While this proposal does 
reduce the length of the resulting chain, it does not impact the estimation aspects (which still 
involve all of the p"^ values) and it could improve convergence, given that the weighted y^s behave 
like a discretized version of a sample from the target density tt. We will not cover this alternative 
any further. 

The original version of the block IMH algorithm is described in Algorithm [2| The algorithm 
is made of a loop on the b blocks and an inner loop on the p parallel chains of each block. The p 
steps of this inner loop are actually meant to be implemented in parallel. The output of Algorithm 
HJis double: 
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p y. p block 




p X p block 



p X p block 



time 



t 



t + p 



t + 2p + 1 



Figure 3: The block IMH algorithm runs p parallel chains during p steps, then picks one of the 
final values (represented by the black squares) and iterates. (An alternative transition mechanism 
involves sampling randomly one of the p^ terms within the block.) 



a standard Markov chain of length T, which is made of h chains of length p, each of which 
is chosen among p chains at line 12 of Algorithm [2| 



a. p X T array {Xf)^^l.'^, on which the estimator 7^2 is based. 



2.3 Savings on computing time 

Since the point-wise evaluation of the target density 7[{yk) is usually the most computer-intensive 
part of the algorithm, sampling additional uniform variables has a negligible impact here, as do 
further costs related to the storage of vectors larger than in the original IMH. This is particularly 
compelling since the multiple chains do not need to be stored further than during a single block 
execution time. That is why, although we sample p times more uniforms in the block IMH 
algorithm, we still consider it to be running at roughly of the same cost as the IMH algorithm. 
The number of target density evaluations indeed is the same for both and most often represent 
the overwhelming part of the computing time in the Metropolis-Hastings algorithm. Besides, 
pseudo-random generation of uniforms can also benefit from parallel processing, see e.g. [L'Ecuyer 



et al. (2001) 



In the following Monte Carlo experiment, various versions of the block IMH algorithm are 
compared one to another, as well as to standard IMH and importance sampling. We stress that 
a straightforward reason for not conducting a comparison with a plain parallel algorithm based 
on p independent parallel chains is that it does not make much sense cost-wise. Indeed, running 
p parallel MCMC chains of the same length T does cost p times more in terms of target density 
evaluations. Obviously, if one insists on running p independent chains, for instance as to initialize 
an MCMC algorithm from several well-dispersed starting points, each of those chains can benefit 
from our stabilizing method, which will improve the resulting estimation. 

The method is presented here for square blocks of dimension {p,p), but blocks could be rect- 
angular as well: the algorithm is equally valid when using r ^ p permutations, leading to (r, p) 
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Algorithm 2 block IMH algorithm 



1: Set xo to an arbitrary value, compute loq 

2: Set Xstart = Xo, Wstart = '^O 

3: Set a block size p, and a number of blocks b, such that b * p = T 

4: Generate all proposed values yi, . . . , ?/r ~ 

5: Compute all ratios ui, . . . ,ujt 

6: for z = 1 to 6 do 

7: Choose p permutations cxi, . . . , CTp 

8: for A; = 1 to p do 

9: Run p steps of an IMH given: 

• (^^startj '^start) 

• p proposed values i/(i_i)*p+i, . . . , yi*p shuffled with the permutation ak 

• the p corresponding ratios u/s 

10: Save as a^(^li)^p+i, • • • , xl'^l the resulting chain 
11: end for 

12: Draw an index j uniformly in {1, . . . ,p}, set Xstart 

ratio u. 
13: end for 



= xl ' set cjstart as the corresponding 



blocks. We focus here on square blocks because when the machine at hand provides p parallel 
processing units, then it is most efficient to simulate the proposed values and the uniforms, and 
to compute the target densities and the acceptance ratios at the p proposed values in parallel. 
Once again, the block IMH algorithm with p x p square blocks has about the same cost as the 
original IMH algorithm, because computing target densities and acceptance ratios does more than 
compensate for the cost of randomly picking an index at the end of each block. This amounts to 
say that line |4] of Algorithm [l] and line [s] of Algorithm |2] are (by far) the most computationally 
demanding ones in the respective algorithms. 



2.4 Toy example 

We now introduce a toy example that we will follow throughout the paper. The target vr is the 
density of the standard 7\^(0, 1) normal distribution and the proposal /i is the density of the C(0, 1) 
Cauchy distribution. Hence, the density ratio is 

uj[x) = — — — oc (1 + x ) exp {—-X ) 

We only consider the integral J X7r(dx), the mean of vr equal to zero in this case. The acceptance 
rate of the IMH algorithm for this example is around 70%. (Note that IMH with higher acceptance 
rates are considered to be more efficient, in opposition to other Metropolis-Hastings algorithms, 
Robert and Casella| |2004[ ) 



see 



In all results related to the toy example presented thereafter, 10, 000 independent runs are 
used to compute the variance of the estimates. The value of p represents the number of parallel 
processing units that are available, ranging from 4 for a desktop computer to 100 for a cluster or 
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a graphics processing unit (GPU) (this value could even be larger for computers equipped with 
multiple CPUs). 

The results of the simulation experiments are presented in Figures |4]-[8] as barplots, which 
indicate the percentage of variance decrease associated with the estimators under comparison, the 
reference estimator being the standard IMH output fi. In agreement with the block sampling 
perspective, the same proposed values and uniform draws were used for all the estimators that are 
plotted on the same graph (that is, for a given value of p), so that the comparison is not perturbed 
by an additional noise associated with the simulation. 



3 Permutations 

While the choice of permutations in line [7] of Algorithm |2] is irrelevant for the validation of the 
parallelization, it has important consequences on the variance improvement and we now discuss 
several natural choices. The idea of testing various orders of the proposed values in a IMH 



algorithm appeared in Atchade and Perron (2005) where the permutations were chosen to be 



circular. We first list natural types of permutations along with some justifications, and then we 
empirically compare their impact on estimation performances for the toy example. 

3.1 Five natural permutations 

Let S be the set of permutations of {1,. . . Its size is p\, therefore too large to allow for 
averaging over all permutations, although this solution would be ideal. We consider the simpler 
option of finding p efficient permutations in S, denoted by (ai, . . . , cXp), the goal being a choice 
favoring the largest possible decrease in the variance of the estimator 7^2 defined in Section [2j 

3.1.1 Same order 

The most basic choice is to pick the same permutation on each of the p chains: 

ai = a2 = ■ ■ ■ = ap 

This selection may sound counterproductive, but we still obtain a significant decrease in the vari- 
ance of f2 using this set of permutations, when compared with fi. The reason for the improvement 
is that p times more uniforms are used in f2 than in fi, leading to a natural Rao-Blackwellization 
phenomenon that is studied in details in Section |4j Nonetheless this simplistic set of permutations 
is certainly not the best choice since it does not integrate out the ancillary randomness resulting 
from the arbitrary ordering of the proposed values. 

3.1.2 Circular permutations 

Another simple choice is to use circular permutations. For 1 < i < p, we define 

ai(l) = i, ai{2) = i + 1, . . . , ai{p -i + l)=p, cri{p - i + 2) = 1, . . . ,ai{p) = i - 1. 

An appealing property of the circular permutations is that each simulated value yk is proposed 
and evaluated at a different step for each chain. However, a drawback is that the order is not 
deeply changed: for instance yk-i will always be proposed one step before yk except for one of the 
p chains, for which yk is proposed first. 
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3.1.3 Random permutations 



A third clioice is to use random orders, tliat is random sliuffiings of the sequence {1, . . . ,p}. We can 
either draw those random permutations with or without replacement in the set S, but considering 
the cardinality of the set S this does not make a large difference. Indeed, it is unlikely to draw 
twice the same permutation, except for very small values of p. 



3.1.4 Half-random half-reversed permutations 

A slightly different choice of permutations consists in drawing p/2 permutations at random {p is 
taken to be even here to simplify the notations). Then, denoting the first p/2 permutations by 
ai, . . . , (Jp/2, we define for 1 < A; < p/2: 

ak+p/2{l) = (^k{p), o-fc+p/2(2) = cTkip - 1), . . . ak+p/2{p) = crk{l) ■ 

The motivation for this inversion of the orders is that, in the second half of the permutations, 
the opposition with the "reversed" first half is maximal. This choice, suggestion of Art Owen 
(personal communication), aims at minimizing the possible common history among the p parallel 
chains. Indeed two chains with the same proposed values in reverse order cannot have a common 
path of length more than 1. 



3.1.5 Stratified random permutations 

Finally we can try to draw permutations that are far from one another in the set S. For instance 
we can define the lexicographic order on S, draw indices from a low discrepancy sequence on the 
set {!,... ,p\} and select the permutations corresponding to these indices. In a simpler manner, 
we do use here a stratified sampling scheme: we first draw a random permutation conditionally 
on its first element being 1, then another permutation beginning with 2, and so forth until the 
last permutation which begins with p. 



3.2 A Monte Carlo comparison 

We compare the five described types of permutations on the toy example. Figure |4] shows the 
results for various values of p, displaying the variance reduction of f2 associated with each of the 
permutation orders, compared to the variance of the original IMH estimator fi. For each of the 
10, 000 independent replications, the block IMH algorithm was launched on one single pxp block, 
e.g. with 6=1 using the notation of Section [2| since b plays no role whatsoever in this comparison. 

As mentioned above, using the same order in the i/kS for each of the p parallel chains already 
produces a significant decrease of about 20% in the variance of the estimators. This simulation 
experiment shows that the three random permutations (random, half-random half-reversed and 
stratified) are quite equivalent in terms of variance improvement and that they are significantly 
better than the circular permutation proposal, which only slightly improves over the "same order" 
scheme. Therefore, in the next Monte Carlo experiments, we will only use the random order 
solution, simplest of the random schemes. An amount of improvement like 35% when p > 32 
is quite impressive when considering that it is essentially obtained cost-free for a computer with 



parallel abilities (Holmes et al. 2011). 
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Figure 4: Variance reductions, when compared with the basic estimator fi, of the various block 
estimators f2 associated with each permutation scheme for several values of p. 

4 Rao— Black welliz at ion 

Another generic improvement that can be brought over classical MH algorithms is Rao-Blackwel- 



Gelfand and Smith 


1990 


Casella and Robert 


1996 



lization methods are presented, one that is computationally free and one that, on the contrary, is 
computationally expensive. We then implement both solutions within the block IMH algorithm 
and explain why the "same order" scheme already improves upon the IMH algorithm. 



4.1 Primary Rao— Blackwellization 



Within the standard IMH algorithm of Section 2.1, a cost- free improvement can be obtained by 



a straightforward Rao-Blackwellization argument. Given that at step t + i, Ui is accepted with 
probability p{xt+i-i,yi) and rejected with probability 1 — p{xt+i-i,yi), the weight of i/i can be 
updated by p{xt+i-i,yi) and the weight of the simulated value yj corresponding to Xt+i-i can be 
similarly updated by the probability 1 — p{xt+i-i, yi). Considering next the block IMH algorithm, 
at the beginning of each block we can define p weights, denoted by {wk)\=i-, initialized at and 
then, for the first of the p parallel chains, denoting by j the index such that x^^^^_-^ = yj, we update 
these weights at each time t + i as 

Wj ^ Wj + 1 -p(a;S+\^i,2/i) 
Wi ^ Wi + p{xl^^i_^,yi) 

This is obviously repeated for each of the other parallel chains, ending up with Y2k '^k = P^- This 
leads to a new estimator 

1 ^ 

>yi:p) = :;^^WkKyk) ■ 

k=0 

This estimator still depends on all uniform generations created within the block, since those 
weights Wk depend upon the acceptances and rejections of the y^s made during the block update. 



p^ 
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However, along the steps of the block, the w^s are basically updated by the expectations of the 
acceptance indicators conditionally upon the results of the previous iterations, whereas the ra^ of 
Section |2] are directly updated according to the acceptance indicators. Hence, the WkS have a 
smaller variance than the n^s by virtue of the Rao-Blackwell theorem, leading to fa necessarily 
having a smaller variance than f2. 

We now discuss a more involved Rao-Blackwellization technique first proposed by [Casella and 



Robert (1996). 



4.2 Block Rao— Blackwellization 



Exploiting the Rao-Blackwellization technique of Casella and Robert (1996) within each par- 



allel chain provides via a conditioning argument an even more stable approximation of arbi- 



(X 



trary posterior quantities. As developed in Casella and Robert (1996), for a single Markov chain 
Xp^), a Rao-Blackwell weighting scheme on the proposed values yt, with weights (ft, is 



given by a recursive scheme 



where (t > 0) 



and 



u=t+l 



it ! 



t-1 

i=o 

associated with the Metropolis-Hastings ratios 

uJt = 7r{yt)/iJ,{yt) , ptu = ^ul^t ■ 

The cumulated computation of the (5t's, of the ptu^ and of the ^tu^ requires an O(p^) computing 
time. Given that p is usually not very large, this additional cost is not redhibitory as in the 



original proposal of Casella and Robert ( 1996 ) who were considering the application of this Rao- 
Blackwellization technique over the whole chain, with a cost of O(T^) (see also Perron, 1999). 



Therefore, starting from the estimator f2, the weight counting the number of occurrences 
of in the p X p block can be replaced with the expected number ip^ of times occurs in this 
block (given the p proposed values), which is the sum of the expected numbers of times yk occurs 
in each of the p parallel chain: 

p 

i=l 

Since the p parallel chains incorporate the proposed values with different orders, the yj's may differ 
for each chain and must therefore be computed p times. Note that the cost is still in O(p^) if this 
computation can be implemented in parallel. Then, by a Rao-Blackwell argument, f2 and T3 are 
dominated by f4 defined as follows: 



T4[Xt,yi:p) 



1 ^ 

- ^ ^kh{yk 

7 n 



k=0 
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Figure 5: Variance improvement over the basic estimator fi for three improved block IMH 
estimators. 

Therefore, this Rao-Blackwelhzation scheme involves no uniform generation for the computation 
of 74: the randomness associated with these uniforms is completely integrated out. 
The four estimators defined up to now can be summarized as follows: 

• f"! is the basic IMH estimator of [h{X)], 

• f2 improves fi by averaging over permutations of the proposed values, and by using p times 
more uniforms than fi, 

• fs improves upon f2 by a basic Rao-Blackwell argument, 

• improves upon the above by a further Rao-Blackwell argument, integrating out the ancil- 
lary uniform variables, but at a cost of O(p^). 

Note that these four estimators all involve the same number p of target density evaluations, which 
again represent the overwhelming part of the computing time. 

4.3 A numerical evaluation 

Figure [5] gives a comparison between the variances of the three improved estimators defined above 
and the variance of the basic IMH estimator. The permutations are random in this case. As was 
already apparent on Figure |4| the block estimator f2 is significantly better than fi for any value 
of p. Moreover, both Rao-Blackwellization modifications seem to improve only very slightly the 
estimation when compared with f2, even though the improvement increases with p. 

Recall that the "same order" scheme already provided a significant decrease in the variance of 
the estimation. In the light of our results, our interpretation is that using p parallel chains with 
the same proposed values acts like a "poor man" Rao-Blackwellization technique since p times 
more uniforms are used. Specifically, each of the p proposed values is proposed p times instead of 
once, thus reducing the impact of each single uniform draw on the overall estimation. 
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When we use Rao-Blackwellization on top of the block IMH, in the estimators 7^3 and 74, 
we try indeed to integrate out a randomness that aheady is partly gone. This explains why, 
although Rao-Blackwellization techniques provide a significant improvement over standard IMH, 
the improvement is much lower and thus rather unappealing when used in the block IMH setting. 
This outcome was at first frustrating since Rao-Blackwellization is indeed affordable at a cost of 
only O(p^). However, this shows in fine that the improvement brought by the block IMH algorithm 
roughly provides the same improvement as the Rao-Blackwell solution, at a much lower cost. 

4.4 Comparison with Importance Sampling 

The proposal density /i may also be used to construct directly an importance sampling (IS) 
estimator 



where the values i/t are drawn from /i. It therefore makes sense to compare the IMH algorithm 
with an IS approximation because IS is similarly easy to parallelize, and straightforward to pro- 
gram. Furthermore, since the IS estimator does not involve ancillary uniform variables, it is 
comparable to the Rao-Blackwellized version of IMH, and hence to the block IMH. Obviously, 
IS cannot necessarily be used in the settings when IMH algorithms are used, because the latter 
are also considered for approximating simulations from the target density tt. In particular, when 
considering Metropolis-within-Gibbs algorithms, IS cannot be used in a straightforward manner, 
even for approximating integrals. 

Before giving numerical results for a comparison run on the toy example, we now explain why 
in this comparison we took the number of blocks to be larger than 1. The previous comparisons 
were computed with 6 = 1, i.e. on a single p x p block and for a large number of independent 
runs. The choice of b was then irrelevant since we were comparing methods that were exploiting 
in different ways the p proposed values generated in each block. When considering the block IMH 
algorithm as a whole, as explained in Section |2} the end of each block sees a new starting value 
chosen from the current block. This ensures that the algorithm produces a valid Markov chain. 
However, our construction also implies that the successive blocks produced by the algorithm are 
correlated, which should lead to lesser performances than for the IS estimator. 

In the comparison between IMH and IS, we therefore need to take into account this correlation 
between successive blocks. To this effect, we produce the variance reductions for several values 
of b. Those reductions are presented in Figure [6] for p = IQ and different values of 6 = 1, 10, 100. 
Once again, the permutations in the block IMH algorithm are chosen to be random. 

Figure [6] shows the a priori surprising result that, when selecting 6 = 1 in the experiment, 
the variance results are in favor of the block IMH solutions over the IS estimator, but, for any 
realistic application, b is (much) larger than 1. For all b > 10, the IS estimator has a smaller 
variance than the three alternative block IMH estimators, if only by a small margin. (Note that 
the variance improvement over the original MCMC estimator is slightly increasing with b despite 
the correlation between blocks, given that the correlation between the p^ terms involved in the 
block IMH estimators is lower than the correlation in the original MCMC chain.) This experiment 
thus shows that the block IMH solution gets very close to the IS estimator, while preserving the 
Markovian features of the original IMH algorithm. 
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Figure 6: Variance reduction, when compared with the basic estimator fi, of the three improved 
block IMH estimators, and the IS estimator fjs, for p = 16 and 6 = 1, 10, 100. 



5 A probit regression illustration 

In order to evaluate the performances of the parallel processing presented in this paper on a 



realistic example, we examine its implementation for the probit model already analyzed in Marin 



and Robert (2010) for the comparison of model choice techniques because the "plug- in" normal 
distribution based on MLE estimates of the first two moments works perfectly as an independent 
proposal. 

A probit model can be represented as a natural latent variable model in that, if we consider a 
sample zi, Zn of n independent latent variables associated with a standard regression model, 
i.e. such that Zi\9 ~ TV [xj9, l), where the Xj's are p-dimensional covariates and 9 is the vector of 



regression coefficients, then yi, 



Vr 



such that 



is a probit sample. Indeed, given 9, the yi's are independent Bernoulli rv's with P(?/j = 1\9) = 
$ [xj9^ where $ is the standard normal cdf. The choice of a prior distribution for the probit model 
is open to deb ate, but the above connection with the latent regression model induced [Marin and 



Robert 



( [2007| to suggest a ^f-prior model, 9 ^ M (Op, n[X'^X) ^) , with n as the g factor and X 
as the regressor matrix. 

While a Gibbs sampler taking advantage of the latent variable structure is implemented 

a straightforward 



m 



1993) 



Marin and Robert (2010) and earlier references (Albert and Chib, 
Metropolis-Hastings algorithm may be used as well, based on an independent proposal M{9,c'E), 
where 9 is the MLE estimator, E its asymptotic variance, and c a scaling factor. 



As in Marin and Robert (2010) and Girolami and Calderhead (2010), we use the R Pima Indian 



benchmark dataset (R Development Core Team, 2006), which contains medical information about 



332 Pima Indian women with seven covariates and one explained binary diabetes variable. 

For the purpose of illustrating the implementation of the block IMH algorithm, we only con- 
sider here three covariates, namely plasma glucose concentration (with coefficient ^1), diastolic 
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blood pressure (with coefficient 62) and diabetes pedigree function (with coefficient 63). We are 
interested in the posterior mean of those three regression parameters. In our experiment, we ran 
10, 000 independent rephcations of our algorithm to produce a reliable evaluation of the variance 
of the estimators under comparison. In Figure [7] we present the variance comparison of the four 
estimators described in Section |4| for p = 4 and p = 48 and for each of the three regression 
parameters. In the independent proposal, the scale factor is chosen to be 3 since pilot runs showed 
that it led to an acceptance rate around 37%, with thus enough rejections to exhibit improvement 
by Rao-Blackwellization. 

The results shown in Figure [7] confirm the huge decrease in variance previously observed in 
the toy example. The gains represented in those figures indicate that the block estimator f2 is 
nearly as good (in terms of variance improvement) as the Rao-Blackwellized block estimators 
and f4, especially when p moves from 4 to 48. This confirms the previous interpretation given 
in Section |4] that the block IMH algorithm provides a cost-free Rao-Blackwellization as well as a 
partial averaging over the order of the proposed values. 

The fact that the toy example showed decreases in the variance that were around 35% whereas 
the probit regression shows decreases around 60% is worth discussing. The amount of decrease 
in the variance differs from one example to the other, but it is more importantly depending on 
the acceptance rate of the Metropolis-Hastings algorithm. In fact, Rao-Blackwellization and 
permutations of the proposed values are useless steps if the acceptance rate is exactly 1. On the 
opposite, it may result in a significant improvement when the acceptance rate is low (since the 
part of the variance due to the uniform draws would then be much more important). 

To illustrate the connection between the observed improvement and the acceptance rate, we 
propose in Figure [8] a variance comparison for p = IQ and two scaling factors c of the proposal 
covariance matrix in the probit regression model. In the previous experiment, we have used c = 3, 
which leads to an acceptance ratio around 37%. Here, if we take c = 1, the acceptance ratio 
rises to 96%, and hence almost all the proposed values are accepted. In this case permuting the 
proposed values and using Rao-Blackwellization techniques does not bring much of a variance 
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T2 T3 r4 T2 Ts T2 T3 T4 

Figure 8: Variance comparison for p = IQ and two scaling factors: c = 1, with an associated 
acceptance rate of 96% (top) and c = 10, with an associated acceptance rate of 8% (bottom). 



decrease (Figure |8| top). On the other hand, if we take c = 10, the acceptance ratio drops down 
to 8% and the observed decrease in variance is huge. In this second case using all the proposed 
values gives much better results than relying on the standard IMH estimator, which is only based 
on 8% of the proposed values that were accepted (Figure [sj bottom). 

The difference observed with this range of scaling factors is thus in agreement with the larger 
decrease in variance observed for the probit regression. This is a positive feature of the block IMH 
method, since in a complex model, the target distribution is most often poorly approximated by 
the proposal and thus the acceptance rate of the IMH algorithm is quite likely to be low. 



6 Conclusion 



The Monte Carlo experiments produced in this paper have shown that the proposed method 
improves significantly the precision of the estimation, when compared with the standard IMH 
algorithm. Beyond these examples, we see multiple situations where the block IMH algorithm 
can be used to improve the estimation in challenging problems. First, as already stated, the IMH 
algorithm can be used in Metropolis- within-Gibbs algorithms (Gilks et al. , 1995). Obviously if a 



single IMH step is performed for each component of the state, then the block IMH technique cannot 
be applied without incurring additional costs. However, it is also correct to update each component 
multiple times instead of once. Furthermore, an uniform Gibbs scan is rarely the optimal way to 
update the components and more sophisticated schemes have been studied, resulting in random 



scan Gibbs samplers and adaptive Gibbs samplers (Latuszynski and Rosenthal, 2010), where the 



probability of updating a given component depends on the component and is learned along the 
iterations of the algorithm. Hence if a component is updated n times more often than another, n 
IMH can be performed in a row, which allows the use of the block IMH technique with p = n. 



IMH steps are also used within sequential Monte Carlo (SMC) samplers (Chopin 2002 Del 
Moral and Doucet, 2003), to diversify the particles after resampling steps. In this context, an 
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independent proposal can be designed by fitting a (usually Gaussian) distribution on the particles. 
If the move step is repeated multiple times in a row, for instance to ensure a satisfying particle 
diversification, then the block IMH algorithm can be used. 

Related to SMC, another context where the variance reduction provided by block IMH might 
be valuable is the class of particle Markov Chain Monte Carlo methods (Andrieu et al. , 2010). For 
these methods, a particle filter is computed at each iteration of the MH algorithm to estimate the 
target density, and hence it is paramount to make the most out of the expensive computations 
involved by those estimates. This is thus a natural framework for parallelization. 

As a final message, the block IMH method is close to being 100% parallel (except for the random 
draw of an index at the end of each block). Since parallel computing is getting increasingly easy to 
use, the free improvement brought by is available for all implementations of the IMH algorithm. 
Furthermore, even without considering parallel computing, since the method uses the most of each 
target density evaluation, it brings significant improvement when computing the target density is 
very costly. In such settings, the cost of drawing instead of p uniform variates is negligible and 
the block IMH algorithm thus runs in about the same time as the standard IMH algorithm. We 
note that the time required to complete a block in the algorithm is essentially the maximum of 
the p times required to calculate the density ratios Wi. Therefore, if these times widely vary, there 
could be a diminishing saving in computation time as p increases for both the standard IMH and 
the block IMH algorithms. Nonetheless, even in such extreme cases, using % in the block IMH 
algorithm would bring a significant variance improvement at essentially no additional cost. 
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